Conversation
design-proposals/observability-modular-hw-metrics-collection.md
Outdated
Show resolved
Hide resolved
design-proposals/observability-modular-hw-metrics-collection.md
Outdated
Show resolved
Hide resolved
design-proposals/observability-modular-hw-metrics-collection.md
Outdated
Show resolved
Hide resolved
design-proposals/observability-modular-hw-metrics-collection.md
Outdated
Show resolved
Hide resolved
design-proposals/observability-modular-hw-metrics-collection.md
Outdated
Show resolved
Hide resolved
design-proposals/observability-modular-hw-metrics-collection.md
Outdated
Show resolved
Hide resolved
| To support collection of additional HW metrics from GPU, PMU, cache utilization, etc., the current POA implementation | ||
| will be expanded to include new metrics collectors for these HW components. Also, modifications will be made to | ||
| the Edge Node Observability pipeline deployment in the orchestrator to allow it to be deployable as a standalone | ||
| pipeline without requiring other components from the EMF stack. |
There was a problem hiding this comment.
without requiring other components from the EMF stack
Which are these components?
| - **BIOS Metrics**: One option for these metrics is to use the [Telegraf redfish collector](https://github.com/influxdata/telegraf/tree/master/plugins/inputs/redfish) | ||
| to retrieve thermal and power settings. | ||
|
|
||
| ##### New Metrics to Configure and Enable |
There was a problem hiding this comment.
Each of these will require additional permissions to be enabled on the device side and will likely increase resource utilization. Is there any QoS currently enabled to support existing and additional metrics collection? e.g., best-effort collection?
| Instead the Orchestrator Command Line Interface (CLI) tool will be extended to provide commands for a user | ||
| to run to query the Mimir backend for metrics. | ||
|
|
||
| The CLI will receive a command containing the metric to be queried for, the edge node to be checked as well as |
There was a problem hiding this comment.
To clarify this, is the query going to the orchestrator or to the edge node? Or if the requested data is not found on the orchestrator side, then it will be sent to the edge node?
| to run to query the Mimir backend for metrics. | ||
|
|
||
| The CLI will receive a command containing the metric to be queried for, the edge node to be checked as well as | ||
| any time range required by user. If a time range is not provided, then the CLI should use a default time range, |
There was a problem hiding this comment.
Maybe a concern that is already addressed, but how is the clock synchronization is ensured across orchestrator and edge devices to ensure that the requested time range is the same across devices and there is no offset?
| The CLI will receive a command containing the metric to be queried for, the edge node to be checked as well as | ||
| any time range required by user. If a time range is not provided, then the CLI should use a default time range, | ||
| such as the last 5 minutes. The CLI should also support retrieving both averages and sums for metrics over set time | ||
| periods. |
There was a problem hiding this comment.
Is the request made for a single node or for all the nodes? how does this scale?
| periods. | ||
|
|
||
| Within the CLI, it should convert the received query into the PromQL format needed for querying Mimir | ||
| and then send the PromQL query to the Mimir API. When the CLI receives the metrics back from Mimir, |
There was a problem hiding this comment.
What happens if, for some reason, metrics collection fails or becomes unavailable during the requested time range? What if the data is only partially available?
| - Provide documentation on how to install the modular observability workflow. | ||
| - Extend Orchestrator CLI documentation with new commands for metrics querying. | ||
|
|
||
| ## Opens |
There was a problem hiding this comment.
Has support for multi-vendor environments been considered?
| ## Implementation Plan | ||
|
|
||
| - Hardware Metrics Collection. | ||
| - Identify the new hardware metrics collectors to be added to the current edge node metrics service. |
There was a problem hiding this comment.
What are the performance implications when additional metrics are collected, both on the orchestrator side and across the network?
| pipeline and can be used to configure what metrics an edge node reports after it has been deployed without | ||
| requiring a full redeployment or access to the edge node. For modular deployments, should this also be included | ||
| and used for this purpose or should it be exlcuded? | ||
| - Investigate the [Intel Performance Counter Monitor(PCM)](https://github.com/intel/pcm) tool as there may be |
There was a problem hiding this comment.
What happens if the hardware metrics collection fails? i.e., there is some hardware malfunction, or due to misconfigurations, or there is some disruption to sensors doing readings
Description
Please include a summary of the changes and the related issue. List any dependencies that are required for this change.
Fixes # (issue)
Any Newly Introduced Dependencies
Please describe any newly introduced 3rd party dependencies in this change. List their name, license information and how they are used in the project.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Checklist: